Background

This is the third assignment for the BCB420H1 course @ U of T. In this assignment, we will be performing non-thresholdeed GSEA on our ranked set of genes that we created in A2, and we will then visualize our GSEA using Cytoscape.

Loading in Libraries

First, let’s load in all of our libraries for the first part of our analysis.

Assignment 1+2 Recap

We are working with the GSE205517 dataset. This dataset and its accompanying study investigates the differentiation of hPSCs into left ventricular cardiomyocytes, and compares them to patient derived samples. (Dark et al. 2023) The study compares the transcriptomes of both conditions in order to determine similarity for potential therapeutic purposes. This study is a SuperSeries containing the SubSeries: - GSE203375, containing the hPSCs - GSE204885, containing the patient samples We will download both. In the previous assignment, we performed normalization on the RNA-Seq data, along with remapping HGNC symbols to avoid collisions. In this assignment, we will be perform differential gene expression (DGE) analysis, along with over-representation analysis (ORA). Before we begin these new steps, we will run the data acquisition, data overview, and normalization steps, and overview stats from assignment 1.

For A2, we continued with the cleaned up data and found DGEs using edgeR. We performed BH corrections, and we also displayed volcano plots to point out key genes (not pictured here). We then used heatmaps to show clustering, and then conducted ORA using g:Profiler for up and downregulated pathways. In terms of our findings:

Data Acquisition

First, let’s perform our data acquisition step using GEOQuery. (Davis and Meltzer 2007) We’ll be re-using code blocks from A1.

# GEO Accession numbers
geo_acc_1 <- "GSE203375"
geo_acc_2 <- "GSE204885"

# The filenames for saving/loading data
filename_1 <- paste0(geo_acc_1, ".RData")
filename_2 <- paste0(geo_acc_2, ".RData")

# Reading in files from the GEO or locally
if (!file.exists(filename_1)) {
  gset_1 <- getGEO(geo_acc_1, GSEMatrix=TRUE, getGPL=FALSE)
  saveRDS(gset_1, filename_1)
} else {
  gset_1 <- readRDS(filename_1)
}
gset_1 <- gset_1[[1]]

if (!file.exists(filename_2)) {
  gset_2 <- getGEO(geo_acc_2, GSEMatrix=TRUE, getGPL=FALSE)
  saveRDS(gset_2, filename_2)
} else {
  gset_2 <- readRDS(filename_2)
}
gset_2 <- gset_2[[1]]

We now have our data loaded in. We now need to retrieve the counts tables from the GEO.

# load counts table from GEO
urld <- "https://www.ncbi.nlm.nih.gov/geo/download/?format=file&type=rnaseq_counts"
path_1 <- paste(urld, paste0("acc=", geo_acc_1), paste0("file=", geo_acc_1, "_raw_counts_GRCh38.p13_NCBI.tsv.gz"), sep="&");
path_2 <- paste(urld, paste0("acc=", geo_acc_2), paste0("file=", geo_acc_2, "_raw_counts_GRCh38.p13_NCBI.tsv.gz"), sep="&");

# This checks if we already have the previous matrices saved in or not. If we have a local copy, we don't re-download. This is a time-saving and efficiency measure.
if (!file.exists(paste0(geo_acc_1,"_raw_counts.RData"))) {
  counts_data_1 <- as.matrix(data.table::fread(path_1, header=T, colClasses="integer"), rownames=1)
  name_mapping <- setNames(gset_1@phenoData@data[["title"]], gset_1@phenoData@data[["geo_accession"]])
  colnames(counts_data_1) <- name_mapping[colnames(counts_data_1)]
  saveRDS(counts_data_1, paste0(geo_acc_1, "_raw_counts.RData"))
} else {
  counts_data_1 <- readRDS(paste0(geo_acc_1, "_raw_counts.RData"))
}

# We do the same for our second dataset

if (!file.exists(paste0(geo_acc_2,"_raw_counts.RData"))) {
  counts_data_2 <- as.matrix(data.table::fread(path_2, header=T, colClasses="integer"), rownames=1)
  name_mapping <- setNames(gset_2@phenoData@data[["title"]], gset_2@phenoData@data[["geo_accession"]])
  colnames(counts_data_2) <- name_mapping[colnames(counts_data_2)]
  saveRDS(counts_data_2, paste0(geo_acc_2, "_raw_counts.RData"))
} else {
  counts_data_2 <- readRDS(paste0(geo_acc_2, "_raw_counts.RData"))
}
counts_data <- cbind(counts_data_1, counts_data_2)

Briefly Explaining our Data

This excerpt is directly from A1: Some basic information about the dataset can be obtained through the ExpressionSet we’ve read in. The study investigated differentiation of two different cell lines, H9 of karyotype 46, XX, and MLC2V of karyotype 46, XY, corresponding to cells originating from female and male sources, respectively. It then compared to to patient samples harvested from left and right atrial and ventricular tissue, along with cardiomyocytes individually harvested from each section of the heart from three patients each.

Below are the experimental details provided in the gset objects for both series:

gset_1@experimentData@title # study title
## [1] "Bulk RNA-seq of a time series of hPSCs as they differentiate towards left ventricle cardiomyocytes"
gset_1@experimentData@abstract # study abstract
## [1] "We report that both hESCs and hiPSCs can be differentiated towrads left ventricle cardiomyocytes using an optimised protocol and that these cells transit via first heart field progenitors."
gset_1@experimentData@other$overall_design # summary of experiment
## [1] "Examination of the transcriptome of two hPSC lines as they differentiate towards left ventricle cardiomyocytes; critical time points were collected"
unique(gset_1@phenoData@data[["extract_protocol_ch1.1"]]) # platform
## [1] "KAPA mRNA (polyA) HyperPrep"
unique(gset_1@phenoData@data[["experiment number:ch1"]]) # heart
## [1] "h28" "h30" "h32" "h41" "h18" "h42"
unique(gset_1@phenoData@data[["cell_line:ch1"]]) # cell lines
## [1] "H9"    "MLC2V"
unique(gset_1@phenoData@data[["day:ch1"]]) # day of differentiation
## [1] "0"  "2"  "4"  "10" "15" "20" "40" "60" "6"
gset_1@experimentData@other$platform_id # GPL id
## [1] "GPL20301"
gset_1@experimentData@other$last_update_date # last update date
## [1] "Aug 08 2023"
unique(gset_1@phenoData@data$organism_ch1) # organism
## [1] "Homo sapiens"
gset_2@experimentData@title # study title
## [1] "Bulk-RNA-sequencing from chamber specific heart tissue and isolated cardiomyocytes"
gset_2@experimentData@abstract # study abstract
## [1] "Healthy hearts from warm cadavers not usable for heart transplants were identified and tissues were isolated from these. From ventricle tissues, cardiomyocytes were further isolated. Bulk-RNA-sequencing was performed."
gset_2@experimentData@other$overall_design # summary of experiment
## [1] "Bulk-RNA-sequencing from 3 different donors (tissue) and 3 other donors (isolated CMs)"
unique(gset_2@phenoData@data[["extract_protocol_ch1.1"]]) # platform
## [1] "KAPA mRNA (polyA) HyperPrep"
unique(gset_2@phenoData@data[["chamber:ch1"]]) # cell lines
## [1] "LV" "RV" "LA" "RA"
unique(gset_2@phenoData@data[["patient:ch1"]]) # patient
## [1] "P1" "P2" "P3" "P4" "P5" "P6"
unique(gset_2@phenoData@data[["tissue:ch1"]]) # tissue
## [1] "Ventricle Isolated Cells" "Heart tissue"
gset_2@experimentData@other$platform_id # GPL id
## [1] "GPL20301"
gset_2@experimentData@other$last_update_date # last update date
## [1] "Apr 26 2023"
unique(gset_2@phenoData@data$organism_ch1) # organism
## [1] "Homo sapiens"
sample_info_1 <- gset_1@phenoData@data[ , (ncol(gset_1@phenoData@data)-2):ncol(gset_1@phenoData@data)]
colnames(sample_info_1) <- gsub(":ch1", "", colnames(sample_info_1)) 
knitr::kable(sample_info_1, format = "pipe", caption = "<b>Table 1:</b> Sample information for Cardiomyocytes Derived from Stem Cells")
Table 1: Sample information for Cardiomyocytes Derived from Stem Cells
cell_line day experiment number
GSM6170585 H9 0 h28
GSM6170586 H9 2 h28
GSM6170587 H9 4 h28
GSM6170588 H9 10 h28
GSM6170589 H9 15 h28
GSM6170590 H9 20 h28
GSM6170591 H9 40 h28
GSM6170592 H9 60 h28
GSM6170593 H9 0 h30
GSM6170594 H9 2 h30
GSM6170595 H9 4 h30
GSM6170596 H9 6 h30
GSM6170597 H9 10 h30
GSM6170598 H9 15 h30
GSM6170599 H9 20 h30
GSM6170600 H9 40 h30
GSM6170601 H9 60 h30
GSM6170602 H9 0 h32
GSM6170603 H9 2 h32
GSM6170604 H9 4 h32
GSM6170605 H9 6 h32
GSM6170606 H9 15 h32
GSM6170607 H9 20 h32
GSM6170608 H9 40 h32
GSM6170609 H9 60 h32
GSM6170610 H9 0 h41
GSM6170611 H9 2 h41
GSM6170612 H9 4 h41
GSM6170613 H9 6 h41
GSM6170614 H9 10 h41
GSM6170615 H9 15 h41
GSM6170616 H9 20 h41
GSM6170617 H9 40 h41
GSM6170618 MLC2V 0 h18
GSM6170619 MLC2V 2 h18
GSM6170620 MLC2V 4 h18
GSM6170621 MLC2V 6 h18
GSM6170622 MLC2V 10 h18
GSM6170623 MLC2V 15 h18
GSM6170624 MLC2V 20 h18
GSM6170625 MLC2V 40 h18
GSM6170626 MLC2V 60 h18
GSM6170627 MLC2V 0 h42
GSM6170628 MLC2V 2 h42
GSM6170629 MLC2V 4 h42
GSM6170630 MLC2V 6 h42
GSM6170631 MLC2V 10 h42
GSM6170632 MLC2V 15 h42
GSM6170633 MLC2V 20 h42
GSM6170634 MLC2V 40 h42
GSM6170635 MLC2V 60 h42
sample_info_2 <- gset_2@phenoData@data[ , (ncol(gset_2@phenoData@data)-2):ncol(gset_2@phenoData@data)]
colnames(sample_info_2) <- gsub(":ch1", "", colnames(sample_info_2)) 
knitr::kable(sample_info_2, format = "pipe", caption = "<b>Table 2:</b> Sample information for Patient-Derived Samples")
Table 2: Sample information for Patient-Derived Samples
chamber patient tissue
GSM6199297 LV P1 Ventricle Isolated Cells
GSM6199298 RV P1 Ventricle Isolated Cells
GSM6199299 LV P2 Ventricle Isolated Cells
GSM6199300 RV P2 Ventricle Isolated Cells
GSM6199301 LV P3 Ventricle Isolated Cells
GSM6199302 RV P3 Ventricle Isolated Cells
GSM6199303 LA P4 Heart tissue
GSM6199304 LV P4 Heart tissue
GSM6199305 RA P4 Heart tissue
GSM6199306 RV P4 Heart tissue
GSM6199307 LA P5 Heart tissue
GSM6199308 LV P5 Heart tissue
GSM6199309 RA P5 Heart tissue
GSM6199310 RV P5 Heart tissue
GSM6199311 LA P6 Heart tissue
GSM6199312 LV P6 Heart tissue
GSM6199313 RA P6 Heart tissue
GSM6199314 RV P6 Heart tissue

Data-Cleaning

To clean our data, we will do two things: 1. Filter out zero-expressors 2. Perform HGNC re-mapping 3. Normalize the data

Filtering Out Zero-Expressors

From A1, we know that there are genes that minimally express throughout different conditions, and as a result, we will be filtering those out.

# Pick out all rows that have zeroes across all conditions
rows_to_keep <- apply(counts_data, 1, function(row) any(row != 0))

# Create new matrix containing only the rows that aren't all zeroes
counts_data_zero_filtered <- counts_data[rows_to_keep, ]

Data Normalization

In line with the experimenters’ analyses, we will normalize the data together

First, let’s perform CPM filtering to remove low expressors.

# Minimum number of samples for the hPSC run
min_num_samples <- 2
counts_data_zero_filtered_matrix <- as.matrix(counts_data_zero_filtered)

# Get rid of low counts
# We add 0.1 for numerical stability and to prevent NA values when evaluating logs
keep = rowSums(log2(cpm(counts_data_zero_filtered_matrix+0.1)) > 1) > min_num_samples
filtered_counts_matrix = counts_data_zero_filtered_matrix[keep,]

After, we will use TMM from edgeR (Robinson, McCarthy, and Smyth 2010) to perform our normalization steps.

# We will make our DGEList with the filtered count matrices
d <- DGEList(counts=filtered_counts_matrix)

# Calculate normalization factors
d <- calcNormFactors(d)

# Apply normalization
normalized_counts <- cpm(d)

Normalization is now done, and we can move onto HGNC mapping.

HGNC Mapping

The protocol followed for HGNC mapping is nearly identical to the first assignment. First, we will add a column to our normalized with the appropriate NCBI gene ID. While we do have the annotation table, it is also true that we were specified to still manually perform this step. We will download a copy of the HGNC dataset. (Tweedie et al. 2020) The dataset is confirmed to be up-to-date with the date of download.

# Download the HGNC file if it doesn't already exist in the cwd
if (!file.exists("hgnc_genes.RData")) {
  hgnc_genes <- import_hgnc_dataset(file=latest_archive_url())
  saveRDS(hgnc_genes, "hgnc_genes.RData")
} else {
  hgnc_genes <- readRDS("hgnc_genes.RData")
}
hgnc_mapping <- hgnc_genes[, c('symbol', 'entrez_id')]

Now that we have a mapping, we can apply this to the table.

# Convert the normalized counts into a dataframe
normalized_counts_df <- as.data.frame(normalized_counts)

# Add rownames
normalized_counts_df$NCBI_gene_id <- rownames(normalized_counts_df)

# Map the entrez_id to the NCBI_gnes
labelled_counts_data <- merge(normalized_counts_df, hgnc_mapping, by.x = "NCBI_gene_id", by.y = 'entrez_id', all.x = TRUE)
labelled_counts_data <- labelled_counts_data[,c(1, ncol(labelled_counts_data), 3:ncol(labelled_counts_data)-1)]

knitr::kable(labelled_counts_data[c(1:10), c(1,2)], format = "pipe", caption = "<b>Table 3:</b> Table Matching First 10 NCBI Gene IDs to the Approved HGNC Symbols")
Table 3: Table Matching First 10 NCBI Gene IDs to the Approved HGNC Symbols
NCBI_gene_id symbol
1 A1BG
100 ADA
1000 CDH2
10000 AKT3
100008587 RNA5-8SN5
100008588 RNA18SN5
100008589 RNA28SN5
100009676 ZBTB11-AS1
10001 MED6
10003 NAALAD2
labelled_counts_data <- labelled_counts_data[, -c(1)]

Now, we must deal with NA and empty values, along with many-to-one mappings.

# Remove all NA labels
no_na_labelled_counts_data <- labelled_counts_data[!is.na(labelled_counts_data$symbol), ]

# Remove all empty strings symbols
no_emp_labelled_counts_data <- no_na_labelled_counts_data[no_na_labelled_counts_data$symbol != '', ]

rownames(no_emp_labelled_counts_data) <- no_emp_labelled_counts_data[, 1]
final_normalized_data <- no_emp_labelled_counts_data[, -c(1)]

Overview Statistics

Now that we have our data, we can look at a few counting statistics to gain insight into what we’re looking at.

# Create a list of overview stats
overview_stats <- list()

# Do it for each sample
for (sample_name in colnames(final_normalized_data)) {
  sample_data <- final_normalized_data[, sample_name]
  
  total_counts <- sum(sample_data)
  mean_counts <- mean(sample_data)
  median_counts <- median(sample_data)
  sd_counts <- sd(sample_data)
  var_counts <- var(sample_data)
  overview_stats[[sample_name]] <- c(total_counts, mean_counts, median_counts, sd_counts, var_counts)
}

overview_stats_df <- do.call(rbind, overview_stats)
rownames(overview_stats_df) <- names(overview_stats)
colnames(overview_stats_df) <- c("Total Counts", "Mean Counts", "Median Counts", "Counts Standard Deviation", "Counts Variation")

knitr::kable(overview_stats_df[, ], format = "pipe", caption = "<b>Table 4:</b> Overview Statistics of Normalized Samples. The total counts are somewhat similar, but the mean counts are lower in the hPSC-derived samples. Mean counts are somewhat similar in both, but the median counts are greater in the hPSCs. The standard deviation and variation in the hPSCs is far lesser than that of the patient samples.")
Table 4: Overview Statistics of Normalized Samples. The total counts are somewhat similar, but the mean counts are lower in the hPSC-derived samples. Mean counts are somewhat similar in both, but the median counts are greater in the hPSCs. The standard deviation and variation in the hPSCs is far lesser than that of the patient samples.
Total Counts Mean Counts Median Counts Counts Standard Deviation Counts Variation
Heart 28 Day 0 H9 1021771.9 59.98426 12.24620 431.8453 186490.34
Heart 28 Day 2 H9 786599.4 46.17819 13.63105 154.4348 23850.12
Heart 28 Day 4 H9 716839.2 42.08285 13.82826 125.4171 15729.45
Heart 28 Day 10 H9 793482.4 46.58227 13.73218 236.6323 55994.84
Heart 28 Day 15 H9 745914.9 43.78977 14.72937 195.7736 38327.31
Heart 28 Day 20 H9 993073.6 58.29949 14.57508 340.7407 116104.22
Heart 28 Day 40 H9 1023479.0 60.08448 14.23242 342.5525 117342.18
Heart 28 Day 60 H9 829562.9 48.70042 14.76538 223.1736 49806.45
Heart 30 Day 0 H9 863916.5 50.71719 12.75642 218.8989 47916.72
Heart 30 Day 2 H9 771994.0 45.32077 13.16358 169.0652 28583.03
Heart 30 Day 4 H9 699609.0 41.07133 13.84685 136.2720 18570.05
Heart 30 Day 6 H9 786225.2 46.15623 14.09130 184.6986 34113.57
Heart 30 Day 10 H9 761009.1 44.67589 14.40243 189.8644 36048.49
Heart 30 Day 15 H9 908915.2 53.35888 13.90010 360.0973 129670.08
Heart 30 Day 20 H9 887759.9 52.11694 13.69645 273.7729 74951.58
Heart 30 Day 40 H9 1010387.1 59.31591 13.56070 489.7979 239901.95
Heart 30 Day 60 H9 920475.0 54.03751 14.47831 373.9279 139822.08
Heart 32 Day 0 H9 779406.7 45.75594 13.18829 149.3232 22297.42
Heart 32 Day 2 H9 770684.2 45.24387 13.37835 153.5595 23580.52
Heart 32 Day 4 H9 702719.5 41.25393 13.64478 129.9497 16886.93
Heart 32 Day 6 H9 772659.8 45.35985 14.03981 170.6316 29115.14
Heart 32 Day 15 H9 844624.9 49.58465 13.86116 271.0413 73463.40
Heart 32 Day 20 H9 831683.9 48.82493 14.49759 253.6298 64328.10
Heart 32 Day 40 H9 805256.4 47.27347 15.13632 206.4093 42604.78
Heart 32 Day 60 H9 1339814.7 78.65532 12.68147 1102.2826 1215026.94
Heart 41 Day 0 H9 747252.1 43.86827 13.25882 164.7467 27141.46
Heart 41 Day 2 H9 771285.9 45.27920 13.46844 181.9923 33121.19
Heart 41 Day 4 H9 727104.3 42.68547 13.48812 172.0213 29591.32
Heart 41 Day 6 H9 744472.3 43.70508 14.23815 176.5771 31179.46
Heart 41 Day 10 H9 800151.2 46.97377 14.00636 311.0775 96769.22
Heart 41 Day 15 H9 927299.3 54.43814 13.63959 425.1034 180712.93
Heart 41 Day 20 H9 809279.0 47.50963 14.63262 230.8466 53290.15
Heart 41 Day 40 H9 826540.3 48.52297 15.18484 211.2894 44643.22
Heart 18 Day 0 MLC2V 802319.7 47.10108 13.25450 158.2561 25045.00
Heart 18 Day 2 MLC2V 764355.6 44.87235 13.46746 169.3843 28691.04
Heart 18 Day 4 MLC2V 707001.2 41.50530 14.09808 150.1960 22558.85
Heart 18 Day 6 MLC2V 769018.6 45.14609 14.25261 169.0281 28570.48
Heart 18 Day 10 MLC2V 791369.4 46.45822 14.52728 219.7194 48276.59
Heart 18 Day 15 MLC2V 791232.2 46.45017 14.62889 233.2285 54395.55
Heart 18 Day 20 MLC2V 899519.9 52.80732 14.82049 244.6640 59860.47
Heart 18 Day 40 MLC2V 935508.8 54.92009 14.67489 361.6569 130795.68
Heart 18 Day 60 MLC2V 933355.7 54.79369 14.68259 421.3648 177548.34
Heart 42 Day 0 MLC2V 816656.5 47.94273 13.07401 196.8750 38759.78
Heart 42 Day 2 MLC2V 789418.7 46.34370 13.20376 183.5629 33695.34
Heart 42 Day 4 MLC2V 730009.3 42.85601 13.51457 162.4742 26397.86
Heart 42 Day 6 MLC2V 779092.2 45.73748 13.98386 184.5081 34043.25
Heart 42 Day 10 MLC2V 860421.6 50.51201 14.31508 232.9254 54254.26
Heart 42 Day 15 MLC2V 775917.9 45.55113 14.74511 236.1324 55758.50
Heart 42 Day 20 MLC2V 849629.4 49.87844 14.70218 252.2778 63644.10
Heart 42 Day 40 MLC2V 931343.7 54.67557 14.76092 407.5923 166131.48
Heart 42 Day 60 MLC2V 972308.9 57.08048 15.28911 496.6643 246675.41
Patient 1 Left Ventricle Isolated Cells 2097613.3 123.14273 12.71544 2054.7924 4222171.89
Patient 1 Right Ventricle Isolated cells 1801820.6 105.77789 12.64497 1652.3177 2730153.91
Patient 2 Left Ventricle Isolated Cells 2448968.4 143.76943 11.90376 2919.5824 8523961.52
Patient 2 Right Ventricle Isolated Cells 2338963.9 137.31149 11.48868 2609.5704 6809857.69
Patient 3 Left Ventricle Isolated Cells 1955785.5 114.81657 12.28001 1785.9598 3189652.30
Patient 3 Right Ventricle Isolated Cells 1841091.5 108.08333 12.63626 1779.0770 3165115.09
Patient 4 Left Atria Tissue 1289436.8 75.69783 15.15308 1052.6384 1108047.51
Patient 4 Left Ventricle Tissue 1993945.7 117.05681 14.52426 2527.4069 6387785.88
Patient 4 Right Atria Tissue 1289380.0 75.69449 14.74821 1068.7044 1142129.05
Patient 4 Right Ventricle Tissue 1729378.7 101.52511 13.97466 1684.3179 2836926.68
Patient 5 Left Atria Tissue 1374196.6 80.67374 14.72154 1212.6323 1470477.14
Patient 5 Left Ventricle Tissue 1610446.1 94.54304 14.15593 1607.2241 2583169.31
Patient 5 Right Atria Tissue 1383940.8 81.24579 15.08133 1388.4255 1927725.46
Patient 5 Right Ventricle Tissue 1649782.1 96.85230 14.02391 1703.4872 2901868.80
Patient 6 Left Atria Tissue 1432786.4 84.11333 14.46312 1383.8852 1915138.25
Patient 6 Left Ventricle Tissue 1393964.4 81.83424 14.18233 1300.1378 1690358.21
Patient 6 Right Atria Tissue 1273212.8 74.74538 14.75168 1055.1550 1113352.02
Patient 6 Right Ventricle Tissue 1460262.8 85.72636 14.21627 1364.0782 1860709.32

We can see that there is strong clustering of all the patient-derived samples, and there’s also temporal clustering of all the stem cell-derived CMs. This is evidenced by the fact the from left to right, we see a ‘continuous’ clustering of cardiomyocytes from d60 to d0. We see that the patient samples all cluster together quite strongly, but further away in the firsta and second dimensions. This can likely be explained due to higher variance and low median counts in the patient data, but the fact that the cells came from warm cadavers, and the cell niche The range for the MDS Leading logFC dimensions are also quite high, which could be explained due to having two different types of data here.

Differential Gene Expression

We will now prefer differential gene expression analyses. Based on the results of the previously generated MDS plots, I think that important analyses would be: * Pairwise comparisons between each time point of the hPSC to cardiomyocyte differentiation + Specifically a comparison between day 0 and day 60 of the stem cell experiments - Comparison of the day 60 * hPSC-cardiomyocyte to the rest of the patient heart cells

Determining p-Values

We will determine the p-values of each of the experiment using edgeR. (Robinson, McCarthy, and Smyth 2010)

# 1. Prepare the data

# Truncate the sample names
converted_names <- gsub("Heart \\d+ Day (\\d+) .*", "day \\1", colnames(final_normalized_data))
converted_names <- gsub("Patient \\d+ (Left|Right) (Ventricle|Atria) .*", "\\1 \\2", converted_names)
converted_names <- gsub(" ", "", converted_names)
# Create a data frame with sample information
samples <- data.frame(converted_names)

# Convert the count data to a DGEList object
dge <- DGEList(counts = final_normalized_data)

# 2. Create the design matrix
# Create the design matrix with only 'day' as a factor
design <- model.matrix(~0 + factor(samples$converted_names))

# Rename the columns to be more interpretable
colnames(design) <- gsub("factor\\(samples\\$converted_names\\)", "", colnames(design))

dge <- estimateDisp(dge, design)

# 4. Fit the model
fit <- glmQLFit(dge, design)

Now, we will create the contrast matrices for comparisons.

# 5. Create contrast matrices
twoVS0con <- makeContrasts(
    day2vsday0 = day2 - day0,
    levels = design
)
fourVS2con <- makeContrasts(
    day4vsday2 = day4 - day2,
    levels = design
)
sixVS4con <- makeContrasts(
    day6vsday4 = day6 - day4,
    levels = design
)
tenVS6con <- makeContrasts(
    day10vsday6 = day10 - day6,
    levels = design
)
fifteenVS10con <- makeContrasts(
    day15vsday10 = day15 - day10,
    levels = design
)
twentyVS15con <- makeContrasts(
    day20vsday15 = day20 - day15,
    levels = design
)
fortyVS20con <- makeContrasts(
    day40vsday20 = day40 - day20,
    levels = design
)
sixtyVS40con <- makeContrasts(
    day60vsday40 = day60 - day40,
    levels = design
)
sixtyVS0con <- makeContrasts(
    day60vsday0 = day60 - day0,
    levels = design
)
lvVS60 <- makeContrasts(
    LVvsday0 = LeftVentricle - day60,
    levels = design
)
rvVS60 <- makeContrasts(
    RVvsday60 = RightVentricle - day60,
    levels = design
)
laVS60 <- makeContrasts(
    LAvsday60 = LeftAtria - day60,
    levels = design
)
raVS60 <- makeContrasts(
    RAvsday60 = RightAtria - day60,
    levels = design
)
lvVSrv <- makeContrasts(
    lvvsrv = LeftVentricle - RightVentricle,
    levels = design
)

Now, we can find differential expression between conditions

# 6. Perform the tests
twoVS0_lrt <- glmQLFTest(fit, contrast = twoVS0con)
fourVS2_lrt <- glmQLFTest(fit, contrast = fourVS2con)
sixVS4_lrt <- glmQLFTest(fit, contrast = sixVS4con)
tenVS6_lrt <- glmQLFTest(fit, contrast = tenVS6con)
fifteenVS10_lrt <- glmQLFTest(fit, contrast = fifteenVS10con)
twentyVS15_lrt <- glmQLFTest(fit, contrast = twentyVS15con)
fortyVS20_lrt <- glmQLFTest(fit, contrast = fortyVS20con)
sixtyVS40_lrt <- glmQLFTest(fit, contrast = sixtyVS40con)
sixtyVS0_lrt <- glmQLFTest(fit, contrast = sixtyVS0con)
lvVS60_lrt <- glmQLFTest(fit, contrast = lvVS60)
rvVS60_lrt <- glmQLFTest(fit, contrast = rvVS60)
laVS60_lrt <- glmQLFTest(fit, contrast = laVS60)
raVS60_lrt <- glmQLFTest(fit, contrast = raVS60)
lvVSrv_lrt <- glmQLFTest(fit, contrast = lvVSrv)

# 7. Extract results
results_twoVS0 <- topTags(twoVS0_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_fourVS2 <- topTags(fourVS2_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_sixVS4 <- topTags(sixVS4_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_tenVS6 <- topTags(tenVS6_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_fifteenVS10 <- topTags(fifteenVS10_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_twentyVS15 <- topTags(twentyVS15_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_fortyVS20 <- topTags(fortyVS20_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_sixtyVS40 <- topTags(sixtyVS40_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_sixtyVS0 <- topTags(sixtyVS0_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_lvVS60 <- topTags(lvVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_rvVS60 <- topTags(rvVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_laVS60 <- topTags(laVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_raVS60 <- topTags(raVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_lvVSrv <- topTags(lvVSrv_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")

I chose to use a cut-off of 0.01, in-line with the original study, and the limit the amount of genes that show up as significant, as there is a lot of temporal variation in stem cell differentiation protocols. In terms of the multiple hypothesis correction, I chose to use Benjamini-Hochberg for its robustness and that it seems quite standard in such analyses.

We will now present summary data in a table for each comparison.

summary_table <- data.frame(
  Comparison = c("Day 2 VS Day 0",
                 "Day 4 VS Day 2",
                 "Day 6 VS Day 4",
                 "Day 10 VS Day 6",
                 "Day 15 VS Day 10",
                 "Day 20 VS Day 15",
                 "Day 40 VS Day 20",
                 "Day 60 VS Day 40",
                 "Day 60 VS Day 0",
                 "LV VS Day 60",
                 "RV VS Day 60",
                 "LA VS Day 60",
                 "RA VS Day 60",
                 "LV VS RV"
                 ),
  Significant = c(length(which(results_twoVS0$table$PValue < 0.01)),
                  length(which(results_fourVS2$table$PValue < 0.01)),
                  length(which(results_sixVS4$table$PValue < 0.01)),
                  length(which(results_tenVS6$table$PValue < 0.01)),
                  length(which(results_fifteenVS10$table$Value < 0.01)),
                  length(which(results_twentyVS15$table$PValue < 0.01)),
                  length(which(results_fortyVS20$table$PValue < 0.01)),
                  length(which(results_sixtyVS40$table$PValue < 0.01)),
                  length(which(results_sixtyVS0$table$PValue < 0.01)),
                  length(which(results_lvVS60$table$PValue < 0.01)),
                  length(which(results_rvVS60$table$PValue < 0.01)),
                  length(which(results_laVS60$table$PValue < 0.01)),
                  length(which(results_raVS60$table$PValue < 0.01)),
                  length(which(results_lvVSrv$table$PValue < 0.01))
                  ),
  Corrected = c(length(which(results_twoVS0$table$FDR < 0.01)),
                length(which(results_fourVS2$table$FDR < 0.01)),
                length(which(results_sixVS4$table$FDR < 0.01)),
                length(which(results_tenVS6$table$FDR < 0.01)),
                length(which(results_fifteenVS10$table$FDR < 0.01)),
                length(which(results_twentyVS15$table$FDR < 0.01)),
                length(which(results_fortyVS20$table$FDR < 0.01)),
                length(which(results_sixtyVS40$table$FDR < 0.01)),
                length(which(results_sixtyVS0$table$FDR < 0.01)),
                length(which(results_lvVS60$table$FDR < 0.01)),
                length(which(results_rvVS60$table$FDR < 0.01)),
                length(which(results_laVS60$table$FDR < 0.01)),
                length(which(results_raVS60$table$FDR < 0.01)),
                length(which(results_lvVSrv$table$FDR < 0.01))
                )
)

kable(summary_table, caption = "Table 5: The number of genes that show the number of significantly differentially expressed genes before and after FDR correction. For the temporal analysis, the first two comparisons hover around 1800 genes, the next two are at around 900, and then there's a steep drop off at the day 15 vs 10 and day 20 vs 15 comparisons. This picks back up. This picks back up intil days 60 to 40, showing minimal change, suggesting that they're in late stage maturation. Between day 60 and patient samples, many genes, all above 6000, are significantly differentially expressed. Between ventricles, no change is observed. ")
Table 5: The number of genes that show the number of significantly differentially expressed genes before and after FDR correction. For the temporal analysis, the first two comparisons hover around 1800 genes, the next two are at around 900, and then there’s a steep drop off at the day 15 vs 10 and day 20 vs 15 comparisons. This picks back up. This picks back up intil days 60 to 40, showing minimal change, suggesting that they’re in late stage maturation. Between day 60 and patient samples, many genes, all above 6000, are significantly differentially expressed. Between ventricles, no change is observed.
Comparison Significant Corrected
Day 2 VS Day 0 3230 1895
Day 4 VS Day 2 3006 1774
Day 6 VS Day 4 2056 866
Day 10 VS Day 6 2301 1006
Day 15 VS Day 10 0 64
Day 20 VS Day 15 1641 416
Day 40 VS Day 20 2501 1163
Day 60 VS Day 40 464 15
Day 60 VS Day 0 9281 8768
LV VS Day 60 10698 10290
RV VS Day 60 10423 9999
LA VS Day 60 7281 6360
RA VS Day 60 7082 6126
LV VS RV 17 0

As expected, the comparison with the greatest number of differentially expressed genes was between day 60 and day 0, which compares an hPSC to a left ventricular cardiomyocyte derived from stem cells. Interestingly, we see a very small amount of differentially expressed genes for day 15 vs day 10 and day 60 and day 40. In terms of the patient sample comparisons, we see that all of them have a great number of genes with a significant values of differential expression, which indicates that even the mature hPSC-derived CM is very far from the patient by transcriptome. This could be explained due to the niche of the cell–cardiomyocytes in vivo are constantly beating, working to pump blood, and are also surrounded by other tissue types which secrete signals to regulate expression of different transcripts. In a dish, this isn’t the case, which could explain this phenomenon. We can investigate this more closely once we see which genes come out as top hits.

We will save our results in a list for ease of downstream analysis.

# List of all the results objects
results_list <- list(
  results_twoVS0 = results_twoVS0,
  results_fourVS2 = results_fourVS2,
  results_sixVS4 = results_sixVS4,
  results_tenVS6 = results_tenVS6,
  results_fifteenVS10 = results_fifteenVS10,
  results_twentyVS15 = results_twentyVS15,
  results_fortyVS20 = results_fortyVS20,
  results_sixtyVS40 = results_sixtyVS40,
  results_sixtyVS0 = results_sixtyVS0,
  results_lvVS60 = results_lvVS60,
  results_rvVS60 = results_rvVS60,
  results_laVS60 = results_laVS60,
  results_raVS60 = results_raVS60,
  results_lvVSrv = results_lvVSrv
)

# Titles for each plot
titles <- c(
  "Day 2 vs Day 0",
  "Day 4 vs Day 2",
  "Day 6 vs Day 4",
  "Day 10 vs Day 6",
  "Day 15 vs Day 10",
  "Day 20 vs Day 15",
  "Day 40 vs Day 20",
  "Day 60 vs Day 40",
  "Day 60 vs Day 0",
  "LV vs Day 60",
  "RV vs Day 60",
  "LA vs Day 60",
  "RA vs Day 60",
  "LV vs RV"
)

Filtering for Top Hits

We will now filter with the same parameters used to filter in the volcano plots.

# Filter each comparison based on the specified criteria
filtered_results_list <- lapply(results_list, function(result) {
  result$table %>%
    filter(FDR < 0.01, abs(logFC) > 2)
})

# Names for the filtered results
names(filtered_results_list) <- names(results_list)

filtered_results_display_df <- data.frame(
  Num_Genes = sapply(filtered_results_list, nrow)
)

kable(filtered_results_display_df, caption = "Table 6: Number of genes meeting criteria for each comparison. Notice that because of the logFC conition, all values are necessarily equal to or smaller than the values in Table 5.")
Table 6: Number of genes meeting criteria for each comparison. Notice that because of the logFC conition, all values are necessarily equal to or smaller than the values in Table 5.
Num_Genes
results_twoVS0 563
results_fourVS2 616
results_sixVS4 368
results_tenVS6 251
results_fifteenVS10 20
results_twentyVS15 132
results_fortyVS20 323
results_sixtyVS40 9
results_sixtyVS0 3720
results_lvVS60 3836
results_rvVS60 3762
results_laVS60 2369
results_raVS60 2289
results_lvVSrv 0

Filtering for All and Unique Differentially Expressed Genes

# Extract gene names from each filtered result and combine them into a single vector
union_gene_names <- unlist(lapply(filtered_results_list, function(filtered_result) {
  rownames(filtered_result)
}))

# Remove duplicate gene names
all_genes <- unique(union_gene_names)

We will also need single genes for downstream analysis.

# Get only the unique genes for downstream analysis
gene_counts <- table(union_gene_names)
unique_genes <- names(gene_counts[gene_counts == 1])

Over-Representation Analysis

Defining Up and Down-Regulated Gene Sets

Here, we will define up and down-regulated gene sets. The goal of this function block was to create a general overall list of genes that were up and down-regulated for each condition, and to create a pooled one as well. In terms of significance values, we’ll use a cut-off of FDR < 0.05 and the magnitude of logFC being 0. These are the same parameters the authors used in their analysis.

ora_filtered_results_list <- lapply(results_list, function(result) {
  result$table %>%
    filter(FDR < 0.05)
})

up_regulated_results_list <- lapply(ora_filtered_results_list, function(result) {
  result %>%
    filter(logFC > 0)
})

up_regulated_results_list <- lapply(up_regulated_results_list, function(result) {
  rownames(result)
})

down_regulated_results_list <- lapply(ora_filtered_results_list, function(result) {
  result %>%
    filter(logFC < 0)
})

down_regulated_results_list <- lapply(down_regulated_results_list, function(result) {
  rownames(result)
})

all_up_regulated_genes <- unique(unlist(up_regulated_results_list))
all_down_regulated_genes <- unique(unlist(down_regulated_results_list))

Setting up the g:Profiler

Let’s set up the g:Profiler for all of the aforementioned combinations (per condition, for significant, up and down-regulated genes). (Reimand et al. 2007) We will be using the g:Profiler since it’s a popular and robust tool for ORA, and it was also a technique learned in a previous homework assignment. The paper only uses the GO BP annotation dataset, and we will be doing the same. This dataset …

go_upreg <- gost(
  query = all_up_regulated_genes, 
  organism = "hsapiens", 
  sources = c("GO:BP")
  )

go_downreg <- gost(
  query = all_down_regulated_genes, 
  organism = "hsapiens", 
  sources = c("GO:BP")
  )

go_updownreg <- gost(
  query = list(Downregulated = all_down_regulated_genes, Upregulated = all_up_regulated_genes), 
  organism = "hsapiens", 
  sources = c("GO:BP")
  )

Analyzing All Up and Down-Regulated Genes, Pooled

We can see that running the analysis together or separately still yields the same number of total pathways being recognized.

Let’s look at a few up-regulated terms.

We will filter pathways with greater than 1000 terms, since they could be ‘ubiquitous’ and yield false positives. Start with up-regulation.

filtered_go_upreg <- go_updownreg$result[
  which(go_updownreg$result$query == "Upregulated" & go_updownreg$result$term_size <= 1000 ),]

And now down-regulation.

filtered_go_downreg <- go_updownreg$result[
  which(go_updownreg$result$query == "Downregulated" & go_updownreg$result$term_size <= 1000 ),]

Analyzing for Pairwise Comparisons

We will extract up and down-regualted genes using the same conditions for all conditions.

# Initialize lists to hold the gene sets
upregulated_genesets <- list()
downregulated_genesets <- list()

# Iterate over the filtered results list
for (condition_name in names(ora_filtered_results_list)) {
  # Extract the data frame for the condition
  condition_df <- ora_filtered_results_list[[condition_name]]
  
  # Filter for upregulated genes (FDR < 0.05 and logFC > 0)
  upregulated_genes <- condition_df %>% 
    filter(FDR < 0.05, logFC > 0) %>% 
    rownames()
  
  # Filter for downregulated genes (FDR < 0.05 and logFC < 0)
  downregulated_genes <- condition_df %>% 
    filter(FDR < 0.05, logFC < 0) %>% 
    rownames()
  
  # Add to the lists
  upregulated_genesets[[condition_name]] <- upregulated_genes
  downregulated_genesets[[condition_name]] <- downregulated_genes
}

First, we’ll look at the temporal comparisons.

# Initialize an empty list to hold the GO analysis for each comparison
days_analysis_list <- vector("list", 9)

# Iterate over each comparison
for (i in 1:9) {
  cond <- names(ora_filtered_results_list)[i]
  # Perform g:Profiler analysis for upregulated and downregulated genes separately
  go_analysis <- gost(
    query = list(Downregulated = downregulated_genesets[[cond]],
                 Upregulated = upregulated_genesets[[cond]]),
    organism = "hsapiens",
    sources = c("GO:BP")
  )
  
  # Add the results to the list
  days_analysis_list[[cond]] <- go_analysis
}

# Remove the first 9 empty entries
days_analysis_list <- days_analysis_list[-(1:9)]

And now the patient ones.

# Initialize an empty list to hold the GO analysis for each comparison
patient_analysis_list <- vector("list", 4)

# Iterate over each comparison
for (i in 10:13) {
  cond <- names(ora_filtered_results_list)[i]
  # Perform g:Profiler analysis for upregulated and downregulated genes separately
  go_analysis <- gost(
    query = list(Downregulated = downregulated_genesets[[cond]],
                 Upregulated = upregulated_genesets[[cond]]),
    organism = "hsapiens",
    sources = c("GO:BP")
  )
  
  # Add the results to the list
  patient_analysis_list[[cond]] <- go_analysis
}

# Remove the first 4 empty entries
patient_analysis_list <- patient_analysis_list[-(1:4)]

We will now display both.

# Initialize an empty data frame to hold the top pathways for each condition and direction
top_pathways_df <- data.frame(matrix(ncol = length(names(days_analysis_list)) * 2, nrow = 10))

# Set the column names for the data frame
colnames(top_pathways_df) <- paste(rep(names(days_analysis_list), each = 2), rep(c("Upregulated", "Downregulated"), times = length(names(days_analysis_list))), sep = "_")

# Fill the data frame with the top pathways
for (cond in names(days_analysis_list)) {
  # Extract the top 10 upregulated pathways
  top_upreg <- days_analysis_list[[cond]]$result %>%
    filter(query == "Upregulated", term_size <= 1000) %>%
    arrange(p_value) %>%
    slice(1:10) %>%
    .$term_name

  # Ensure that we have exactly 10 entries by padding with NA if necessary
  top_upreg <- top_upreg[1:10]
  if (length(top_upreg) < 10) {
    top_upreg <- c(top_upreg, rep(NA, 10 - length(top_upreg)))
  }

  # Extract the top 10 downregulated pathways
  top_downreg <- days_analysis_list[[cond]]$result %>%
    filter(query == "Downregulated", term_size <= 1000) %>%
    arrange(p_value) %>%
    slice(1:10) %>%
    .$term_name

  # Ensure that we have exactly 10 entries by padding with NA if necessary
  top_downreg <- top_downreg[1:10]
  if (length(top_downreg) < 10) {
    top_downreg <- c(top_downreg, rep(NA, 10 - length(top_downreg)))
  }

  # Add the pathways to the data frame
  top_pathways_df[paste(cond, "Upregulated", sep = "_")] <- top_upreg
  top_pathways_df[paste(cond, "Downregulated", sep = "_")] <- top_downreg
}

# Create the kable
kable(top_pathways_df, caption = "Table 7: Top 10 upregulated and downregulated pathways for each temporal comparison. Intepretation of these results are in the Interpretation section.", align = 'c')
Table 7: Top 10 upregulated and downregulated pathways for each temporal comparison. Intepretation of these results are in the Interpretation section.
results_twoVS0_Upregulated results_twoVS0_Downregulated results_fourVS2_Upregulated results_fourVS2_Downregulated results_sixVS4_Upregulated results_sixVS4_Downregulated results_tenVS6_Upregulated results_tenVS6_Downregulated results_fifteenVS10_Upregulated results_fifteenVS10_Downregulated results_twentyVS15_Upregulated results_twentyVS15_Downregulated results_fortyVS20_Upregulated results_fortyVS20_Downregulated results_sixtyVS40_Upregulated results_sixtyVS40_Downregulated results_sixtyVS0_Upregulated results_sixtyVS0_Downregulated
regulation of anatomical structure morphogenesis neuron projection morphogenesis heart development ribosome biogenesis heart development DNA replication muscle system process chromosome organization muscle system process cellular response to growth factor stimulus actin cytoskeleton organization cytoplasmic translation external encapsulating structure organization mitotic cell cycle NA fructose metabolic process muscle structure development ribosome biogenesis
embryonic morphogenesis plasma membrane bounded cell projection morphogenesis muscle structure development ncRNA metabolic process muscle structure development DNA-templated DNA replication muscle contraction mitotic cell cycle striated muscle contraction response to growth factor actin filament-based process peptide metabolic process extracellular matrix organization mitotic cell cycle process NA carbohydrate phosphorylation heart development ncRNA metabolic process
tissue morphogenesis cell projection morphogenesis tissue morphogenesis rRNA metabolic process muscle system process DNA repair striated muscle contraction mitotic cell cycle process actin filament-based movement regulation of cell migration supramolecular fiber organization translation extracellular structure organization chromosome organization NA fructose 2,6-bisphosphate metabolic process tube morphogenesis ncRNA processing
heart development cellular anatomical entity morphogenesis heart morphogenesis rRNA processing muscle tissue development DNA damage response oxoacid metabolic process chromosome segregation cardiac muscle contraction NA cell junction organization peptide biosynthetic process cellular response to growth factor stimulus chromosome segregation NA monosaccharide metabolic process vasculature development translation
head development cell morphogenesis mesenchyme development ncRNA processing striated muscle tissue development chromatin organization organic acid metabolic process nuclear chromosome segregation regulation of muscle system process NA apoptotic signaling pathway amide biosynthetic process circulatory system process mitotic nuclear division NA NA blood vessel development rRNA metabolic process
gastrulation cell morphogenesis involved in neuron differentiation enzyme-linked receptor protein signaling pathway ribosomal small subunit biogenesis cardiac muscle tissue development protein-DNA complex organization carboxylic acid metabolic process cell division actin-mediated cell contraction NA enzyme-linked receptor protein signaling pathway transmembrane receptor protein serine/threonine kinase signaling pathway response to growth factor organelle fission NA NA muscle tissue development amide biosynthetic process
morphogenesis of an epithelium chemical synaptic transmission tube morphogenesis ribosomal large subunit biogenesis muscle contraction chromosome organization generation of precursor metabolites and energy nuclear division muscle contraction NA cell morphogenesis transforming growth factor beta receptor superfamily signaling pathway regulation of system process nuclear division NA NA muscle cell differentiation peptide biosynthetic process
tube morphogenesis anterograde trans-synaptic signaling mesenchymal cell differentiation maturation of SSU-rRNA actin filament-based process double-strand break repair monocarboxylic acid metabolic process sister chromatid segregation cardiac muscle cell contraction NA ameboidal-type cell migration enzyme-linked receptor protein signaling pathway enzyme-linked receptor protein signaling pathway mitotic sister chromatid segregation NA NA striated muscle tissue development rRNA processing
embryo development ending in birth or egg hatching trans-synaptic signaling embryonic morphogenesis peptide metabolic process muscle cell differentiation chromatin remodeling heart process regulation of cell cycle process regulation of system process NA cell adhesion mediated by integrin protein-lipid complex remodeling camera-type eye development sister chromatid segregation NA NA cardiac muscle tissue development DNA damage response
chordate embryonic development regulation of cell projection organization cellular anatomical entity morphogenesis peptide biosynthetic process striated muscle cell differentiation negative regulation of transcription by RNA polymerase II cardiac muscle contraction DNA replication regulation of muscle adaptation NA response to wounding plasma lipoprotein particle remodeling sensory system development nuclear chromosome segregation NA NA blood vessel morphogenesis chromosome organization

Generating Ranks

First, we will be excluding the patient samples from this analysis. They proved to be very problematic with fgsea.

# Set all the conditions involving patients to null
results_list$results_lvVS60 <- NULL
results_list$results_rvVS60 <- NULL
results_list$results_laVS60 <- NULL
results_list$results_raVS60 <- NULL
results_list$results_lvVSrv <- NULL

Now we can generate our ranks, and we’ll save them to a directory called ranks such that we can easily access them downstream during our Cytoscape analysis.

rank_and_write <- function(results_list, dir_name) {
  # Create the directory if it doesn't exist
  if (!dir.exists(dir_name)) {
    dir.create(dir_name)
  }
  
  # Iterate over each comparison in the results_list
  for (comparison in names(results_list)) {
    # Extract the data and check for any missing values in necessary columns
    results_table <- results_list[[comparison]]$table
    if (any(is.na(results_table$logFC)) || any(is.na(results_table$FDR))) {
      stop("Missing values found in logFC or FDR columns.")
    }
    
    # Calculate ranks
    ranks <- -1 * sign(results_table$logFC) * log10(results_table$FDR)
    
    # Add a very small random noise to the rank to break ties
    set.seed(123)  # Setting a seed for reproducibility
    noise <- runif(length(ranks), min=-1e-6, max=1e-6)
    ranks <- ranks + noise
    
    # Prepare data frame for writing
    ordered_results <- data.frame(
      GeneName = rownames(results_table),  # Use row names as GeneName
      rank = ranks
    )
    
    # Remove duplicate gene names if necessary
    ordered_results <- ordered_results[!duplicated(ordered_results$GeneName), ]

    # Order by rank and keep only the necessary columns
    ordered_results <- ordered_results[order(ordered_results$rank, decreasing = TRUE), ]
    
    # Write to .rnk file
    file_name <- paste0(comparison, "_ranks.rnk")
    write.table(ordered_results, quote = FALSE, sep = "\t", col.names = TRUE, row.names = FALSE,
                file = file.path(dir_name, file_name))
  }
}

# Example usage
rank_and_write(results_list, "ranks")

Now that we’ve calculated our ranks, we can move on to performing fgsea.

Non-Thresholded GSEA

First, we’ll select a geneset for the fgsea analysis. We will be using the Bader Lab(bader_lab_2024?) geneset since this was recommended by the course instructor. Additionally, when we do Cytoscape downstream, we’ll also notice that we’ll be using the Bader Lab’s genesets again, so this works to streamline our results.

We’ll fetch the geneset first.

# Define the URL and the local filename
gmt_url <- "https://download.baderlab.org/EM_Genesets/April_01_2024/Human/symbol/Human_GOBP_AllPathways_noPFOCR_no_GO_iea_April_01_2024_symbol.gmt"
gmt_filename <- "Human_GOBP_AllPathways_noPFOCR_no_GO_iea_April_01_2024_symbol.gmt"

# Check if the file already exists
if (!file.exists(gmt_filename)) {
  # Download the file
  download.file(gmt_url, destfile = gmt_filename)
}

# Load the gene sets
gene_sets <- fgsea::gmtPathways(gmt_filename)

fgsea (Korotkevich et al. 2016) is a lightweight implementation of gsea that runs directly in R. I chose to use this since I was having trouble using the recommended GSEA analysis pipeline discussed in class. Additionally, the instructor had also remarked that she’d used fgsea as well, and results were comparable to the base implementation. We are using the newest version to date, 3.18. Therefore, I chose to use this platform.

Now, let’s define a function to run fgsea for each condition that we have in the results_list.

run_fgsea_for_all_conditions <- function(ranks_dir, gene_sets, useSimple = FALSE) {
  nPermSimple = ifelse(useSimple, 100000, NA) # This value is finnicky, adjust according to analysis
  
  # Get the rank files
  rnk_files <- list.files(ranks_dir, pattern = "\\.rnk$", full.names = TRUE)
  gsea_list <- list()

  for (rnk_file in rnk_files) {
    condition <- gsub("_ranks\\.rnk$", "", basename(rnk_file))

    # Skip atria and ventricle related results, if they haven't already been properly filtered out
    if (grepl("LV|RV|LA|RA", condition)) {
      cat("Skipping: ", condition, "\n")
      next
    }

    # This is added for a progress check since it takes long for fgsea to run, especially considering the high nPermSimple value
    print(paste("Working on:", condition))
    ranks <- read.table(rnk_file, header = TRUE, stringsAsFactors = FALSE)
    ranks_vector <- setNames(ranks$rank, ranks$GeneName)

    # Run fgsea
    fgsea_results <- if (useSimple) {
      fgsea::fgsea(pathways = gene_sets, stats = ranks_vector, minSize = 15, maxSize = 500, nPermSimple = nPermSimple)
    } else {
      fgsea::fgseaMultilevel(pathways = gene_sets, stats = ranks_vector, minSize = 15, maxSize = 500)
    }
    
    # Save our results to an gsea_list
    gsea_list[[condition]] <- fgsea_results
  }
  return(gsea_list)
}

Now, let’s get our list of fgsea values.

gsea_list <- run_fgsea_for_all_conditions("ranks", gene_sets, useSimple=TRUE)
## [1] "Working on: results_fifteenVS10"
## [1] "Working on: results_fortyVS20"
## [1] "Working on: results_fourVS2"
## [1] "Working on: results_sixtyVS0"
## [1] "Working on: results_sixtyVS40"
## [1] "Working on: results_sixVS4"
## [1] "Working on: results_tenVS6"
## [1] "Working on: results_twentyVS15"
## [1] "Working on: results_twoVS0"

Just as a double check to remove patient entries:

# Remove specific entries
gsea_list$results_lvVS60 <- NULL
gsea_list$results_rvVS60 <- NULL
gsea_list$results_lvVSrv <- NULL
gsea_list$results_laVS60 <- NULL
gsea_list$results_raVS60 <- NULL

Reformatting the fgsea Results for Cytoscape Analysis

fgsea formats its output differently from the GSEA technique discussed in lecture. To remedy this, instructor, Ruth Isserlin, provided a function on Quercus to perform this conversion. I’ve modified it quite a bit so that we can use it to display table data.

format_fgsea_results <- function(gsea_list, ranks_dir) {
  formatted_gsea_list <- list()
  
  # Load all ranks as a list of named vectors
  rank_files <- list.files(ranks_dir, pattern = "\\.rnk$", full.names = TRUE)
  ranks_list <- lapply(rank_files, function(file) {
    ranks <- read.table(file, header = TRUE, stringsAsFactors = FALSE)
    setNames(ranks$rank, ranks$GeneName)
  })
  names(ranks_list) <- gsub("_ranks\\.rnk$", "", basename(rank_files))

  for (analysis in names(gsea_list)) {
    fgsea_results <- gsea_list[[analysis]]
    ranks_vector <- ranks_list[[analysis]]  # Retrieve the corresponding rank vector
    
    if (length(fgsea_results) > 0 && !is.null(ranks_vector)) {
      calculated_rank_at_max <- sapply(fgsea_results$leadingEdge, function(leading_edge) {
        valid_genes = names(ranks_vector)[names(ranks_vector) %in% leading_edge]
        if (length(valid_genes) > 0) {
          return(max(ranks_vector[valid_genes]))
        } else {
          return(NA)  # Return NA if no valid genes
        }
      })
      
      gsea_results <- data.frame(
        name = fgsea_results$pathway,
        SIZE = fgsea_results$size,
        ES = fgsea_results$ES,
        NES = fgsea_results$NES,
        pval = fgsea_results$pval,
        padj = fgsea_results$padj,
        RankAtMax = calculated_rank_at_max,
        leadingEdgeGenes = sapply(fgsea_results$leadingEdge, paste, collapse = ",")
      )
      
      formatted_gsea_list[[analysis]] <- gsea_results
    } else {
      warning(paste("No valid fgsea results or ranks found for analysis:", analysis))
    }
  }
  
  return(formatted_gsea_list)
}

Now, we’ll format our results.

formatted_gsea_list <- format_fgsea_results(gsea_list, "ranks")

Results from GSEA

Now, let’s look at the results we got from our fgsea analysis.

Let’s break our results down into up and down-regulated analyses

formatted_gsea_df <- lapply(formatted_gsea_list, as.data.frame)

We’ll define a function to filter our pathways for significant up and down-regulated pathways and return both.

# Get all of the pathways with a positive normalization enrichment score that are significant
filter_pathways <- function(df) {
  # Filter for enrichment score in either direction and a adjusted p-value of 0.05
  upregulated_pathways <- df[df$NES > 0 & df$padj < 0.05, ]
  upregulated_pathways <- upregulated_pathways[order(upregulated_pathways$padj, abs(upregulated_pathways$NES), decreasing = c(FALSE, TRUE)),]
  
  downregulated_pathways <- df[df$NES < 0 & df$padj < 0.05, ]
  downregulated_pathways <- downregulated_pathways[order(downregulated_pathways$padj, abs(downregulated_pathways$NES), decreasing = c(FALSE, TRUE)),]
  
  # Return a list
  list(
    upregulated = upregulated_pathways,
    downregulated = downregulated_pathways
  )
}

filtered_pathways <- lapply(formatted_gsea_df, filter_pathways)

Now that we’ve filtered our pathways, we can now display our top 10 up and down-regulated pathways such that we can compare this to our GO analysis.

# Initialize lists for upregulated and downregulated pathways
upregulated_pathways_list <- list()
downregulated_pathways_list <- list()

# Populate the lists with corresponding entries from filtered_pathways
for(condition in names(filtered_pathways)) {
  upregulated_pathways_list[[condition]] <- filtered_pathways[[condition]][["upregulated"]]
  downregulated_pathways_list[[condition]] <- filtered_pathways[[condition]][["downregulated"]]
}

We can define a function to get the top 10 pathways for each condition. 10 pathways each that are most significant gives us a pretty good idea as to what’s changing pathway-wise between these conditions.

# Function to get the top 10 pathway names
get_top_pathways <- function(pathways_df) {
  if(nrow(pathways_df) >= 10) {
    return(pathways_df$name[1:10])
  } else {
    # If there are less than 10, return as many as there are and pad with NAs
    return(c(pathways_df$name, rep(NA, 10 - nrow(pathways_df))))
  }
}

# Applying the function to each list
top_upregulated_pathways <- lapply(upregulated_pathways_list, get_top_pathways)
top_downregulated_pathways <- lapply(downregulated_pathways_list, get_top_pathways)

Let’s do up first.

upregulated_pathways_df <- data.frame(top_upregulated_pathways, stringsAsFactors = FALSE)
upregulated_pathways_df <- upregulated_pathways_df[c("results_twoVS0", "results_fourVS2", "results_sixVS4", "results_tenVS6", "results_fifteenVS10", "results_twentyVS15", "results_fortyVS20", "results_sixtyVS40", "results_sixtyVS0")]

# Generate the table with kable
kable(upregulated_pathways_df, col.names = names(upregulated_pathways_df), caption = "Table 8: Top ranked pathways that are upregulated in pairwise timepoints from GSEA analysis.", align = 'c')
Table 8: Top ranked pathways that are upregulated in pairwise timepoints from GSEA analysis.
results_twoVS0 results_fourVS2 results_sixVS4 results_tenVS6 results_fifteenVS10 results_twentyVS15 results_fortyVS20 results_sixtyVS40 results_sixtyVS0
ANIMAL ORGAN MORPHOGENESIS%GOBP%GO:0009887 STRIATED MUSCLE TISSUE DEVELOPMENT%GOBP%GO:0014706 MUSCLE STRUCTURE DEVELOPMENT%GOBP%GO:0061061 GENERATION OF PRECURSOR METABOLITES AND ENERGY%GOBP%GO:0006091 CELL CYCLE, MITOTIC%REACTOME DATABASE ID RELEASE 81%69278 HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION%MSIGDBHALLMARK%HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION PROTEIN SYNTHESIS: ASPARTIC ACID%SMPDB%SMP0111858 POTASSIUM ION IMPORT ACROSS PLASMA MEMBRANE%GOBP%GO:1990573 CIRCULATORY SYSTEM DEVELOPMENT%GOBP%GO:0072359
CIRCULATORY SYSTEM DEVELOPMENT%GOBP%GO:0072359 CARDIAC MUSCLE TISSUE DEVELOPMENT%GOBP%GO:0048738 MUSCLE SYSTEM PROCESS%GOBP%GO:0003012 AEROBIC RESPIRATION AND RESPIRATORY ELECTRON TRANSPORT%REACTOME%R-HSA-1428517.4 HALLMARK_E2F_TARGETS%MSIGDBHALLMARK%HALLMARK_E2F_TARGETS CELL CYCLE, MITOTIC%REACTOME DATABASE ID RELEASE 81%69278 PROTEIN SYNTHESIS: GLUTAMINE%SMPDB%SMP0111862 INORGANIC ION TRANSMEMBRANE TRANSPORT%GOBP%GO:0098660 MUSCLE STRUCTURE DEVELOPMENT%GOBP%GO:0061061
HEART DEVELOPMENT%GOBP%GO:0007507 MUSCLE TISSUE DEVELOPMENT%GOBP%GO:0060537 REGULATION OF SYSTEM PROCESS%GOBP%GO:0044057 ENERGY DERIVATION BY OXIDATION OF ORGANIC COMPOUNDS%GOBP%GO:0015980 CHROMOSOME ORGANIZATION%GOBP%GO:0051276 MITOTIC METAPHASE AND ANAPHASE%REACTOME DATABASE ID RELEASE 81%2555396 PROTEIN SYNTHESIS: ALANINE%PATHWHIZ%PW101384 NEGATIVE REGULATION OF PROTEIN SECRETION%GOBP%GO:0050709 HEART DEVELOPMENT%GOBP%GO:0007507
ANATOMICAL STRUCTURE FORMATION INVOLVED IN MORPHOGENESIS%GOBP%GO:0048646 CARDIAC CHAMBER DEVELOPMENT%GOBP%GO:0003205 MUSCLE TISSUE DEVELOPMENT%GOBP%GO:0060537 MUSCLE SYSTEM PROCESS%GOBP%GO:0003012 HDR THROUGH HOMOLOGOUS RECOMBINATION (HRR)%REACTOME DATABASE ID RELEASE 81%5685942 MITOTIC ANAPHASE%REACTOME DATABASE ID RELEASE 81%68882 PROTEIN SYNTHESIS: GLUTAMIC ACID%PATHWHIZ%PW112922 MONOATOMIC ION TRANSMEMBRANE TRANSPORT%GOBP%GO:0034220 EXTRACELLULAR MATRIX ORGANIZATION%REACTOME%R-HSA-1474244.4
EMBRYO DEVELOPMENT%GOBP%GO:0009790 CARDIAC SEPTUM DEVELOPMENT%GOBP%GO:0003279 MUSCLE CONTRACTION%GOBP%GO:0006936 HALLMARK_MYOGENESIS%MSIGDBHALLMARK%HALLMARK_MYOGENESIS NUCLEAR CHROMOSOME SEGREGATION%GOBP%GO:0098813 CELL CYCLE CHECKPOINTS%REACTOME DATABASE ID RELEASE 81%69620 PROTEIN SYNTHESIS: PROLINE%PATHWHIZ%PW113695 HALLMARK_INTERFERON_ALPHA_RESPONSE%MSIGDBHALLMARK%HALLMARK_INTERFERON_ALPHA_RESPONSE MUSCLE TISSUE DEVELOPMENT%GOBP%GO:0060537
EMBRYONIC MORPHOGENESIS%GOBP%GO:0048598 CELLULAR COMPONENT ASSEMBLY INVOLVED IN MORPHOGENESIS%GOBP%GO:0010927 MUSCLE CELL DIFFERENTIATION%GOBP%GO:0042692 PURINE NUCLEOTIDE METABOLIC PROCESS%GOBP%GO:0006163 CELL CYCLE CHECKPOINTS%REACTOME DATABASE ID RELEASE 81%69620 HALLMARK_E2F_TARGETS%MSIGDBHALLMARK%HALLMARK_E2F_TARGETS PROTEIN SYNTHESIS: SERINE%PATHWHIZ%PW120517 NA REGULATION OF SYSTEM PROCESS%GOBP%GO:0044057
TUBE DEVELOPMENT%GOBP%GO:0035295 CARDIAC CELL DEVELOPMENT%GOBP%GO:0055006 ACTIN CYTOSKELETON ORGANIZATION%GOBP%GO:0030036 PURINE-CONTAINING COMPOUND METABOLIC PROCESS%GOBP%GO:0072521 MITOTIC CELL CYCLE PROCESS%GOBP%GO:1903047 SEPARATION OF SISTER CHROMATIDS%REACTOME DATABASE ID RELEASE 81%2467813 PROTEIN SYNTHESIS: TRYPTOPHAN%PATHWHIZ%PW120526 NA MUSCLE SYSTEM PROCESS%GOBP%GO:0003012
TISSUE MORPHOGENESIS%GOBP%GO:0048729 OUTFLOW TRACT MORPHOGENESIS%GOBP%GO:0003151 CELLULAR COMPONENT ASSEMBLY INVOLVED IN MORPHOGENESIS%GOBP%GO:0010927 PURINE RIBONUCLEOTIDE METABOLIC PROCESS%GOBP%GO:0009150 REGULATION OF BLOOD CIRCULATION%GOBP%GO:1903522 SYNTHESIS OF DNA%REACTOME DATABASE ID RELEASE 81%69239 PROTEIN SYNTHESIS: ISOLEUCINE%SMPDB%SMP0111872 NA ANIMAL ORGAN MORPHOGENESIS%GOBP%GO:0009887
EPITHELIUM DEVELOPMENT%GOBP%GO:0060429 CARDIAC MUSCLE CELL DEVELOPMENT%GOBP%GO:0055013 STRIATED MUSCLE TISSUE DEVELOPMENT%GOBP%GO:0014706 AEROBIC RESPIRATION%GOBP%GO:0009060 S PHASE%REACTOME DATABASE ID RELEASE 81%69242 RHO GTPASE CYCLE%REACTOME DATABASE ID RELEASE 81%9012999 PROTEIN SYNTHESIS: LEUCINE%SMPDB%SMP0111873 NA MUSCLE CONTRACTION%GOBP%GO:0006936
REGIONALIZATION%GOBP%GO:0003002 CARDIAC CHAMBER MORPHOGENESIS%GOBP%GO:0003206 HALLMARK_MYOGENESIS%MSIGDBHALLMARK%HALLMARK_MYOGENESIS HALLMARK_OXIDATIVE_PHOSPHORYLATION%MSIGDBHALLMARK%HALLMARK_OXIDATIVE_PHOSPHORYLATION MUSCLE SYSTEM PROCESS%GOBP%GO:0003012 TNFR2 NON-CANONICAL NF-KB PATHWAY%REACTOME%R-HSA-5668541.5 PROTEIN SYNTHESIS: ASPARAGINE%SMPDB%SMP0111854 NA MUSCLE CELL DIFFERENTIATION%GOBP%GO:0042692

And now down.

downregulated_pathways_df <- data.frame(top_downregulated_pathways, stringsAsFactors = FALSE)
downregulated_pathways_df <- downregulated_pathways_df[c("results_twoVS0", "results_fourVS2", "results_sixVS4", "results_tenVS6", "results_fifteenVS10", "results_twentyVS15", "results_fortyVS20", "results_sixtyVS40", "results_sixtyVS0")]

# Generate the table with kable
kable(downregulated_pathways_df, col.names = names(downregulated_pathways_df), caption = "Table 9: Top ranked pathways that are downregulated in pairwise timepoints from GSEA analysis.", align = 'c')
Table 9: Top ranked pathways that are downregulated in pairwise timepoints from GSEA analysis.
results_twoVS0 results_fourVS2 results_sixVS4 results_tenVS6 results_fifteenVS10 results_twentyVS15 results_fortyVS20 results_sixtyVS40 results_sixtyVS0
HALLMARK_MTORC1_SIGNALING%MSIGDBHALLMARK%HALLMARK_MTORC1_SIGNALING RIBOSOME BIOGENESIS%GOBP%GO:0042254 DNA REPAIR%GOBP%GO:0006281 CELL CYCLE, MITOTIC%REACTOME DATABASE ID RELEASE 81%69278 NCRNA METABOLIC PROCESS%GOBP%GO:0034660 PROTEIN SYNTHESIS: ASPARTIC ACID%SMPDB%SMP0111858 CELL CYCLE, MITOTIC%REACTOME DATABASE ID RELEASE 81%69278 CYTOPLASMIC TRANSLATION%GOBP%GO:0002181 RIBONUCLEOPROTEIN COMPLEX BIOGENESIS%GOBP%GO:0022613
MODULATION OF CHEMICAL SYNAPTIC TRANSMISSION%GOBP%GO:0050804 RIBONUCLEOPROTEIN COMPLEX BIOGENESIS%GOBP%GO:0022613 PROCESSING OF CAPPED INTRON-CONTAINING PRE-MRNA%REACTOME DATABASE ID RELEASE 81%72203 MITOTIC CELL CYCLE PROCESS%GOBP%GO:1903047 NCRNA PROCESSING%GOBP%GO:0034470 PROTEIN SYNTHESIS: GLUTAMINE%SMPDB%SMP0111862 M PHASE%REACTOME%R-HSA-68886.5 PROTEIN SYNTHESIS: GLUTAMINE%SMPDB%SMP0111862 CELL CYCLE, MITOTIC%REACTOME DATABASE ID RELEASE 81%69278
REGULATION OF TRANS-SYNAPTIC SIGNALING%GOBP%GO:0099177 NCRNA METABOLIC PROCESS%GOBP%GO:0034660 HALLMARK_E2F_TARGETS%MSIGDBHALLMARK%HALLMARK_E2F_TARGETS CHROMOSOME ORGANIZATION%GOBP%GO:0051276 TRANSLATION%GOBP%GO:0006412 PROTEIN SYNTHESIS: LEUCINE%SMPDB%SMP0111873 MITOTIC CELL CYCLE PROCESS%GOBP%GO:1903047 PROTEIN SYNTHESIS: VALINE%PATHWHIZ%PW120528 NCRNA METABOLIC PROCESS%GOBP%GO:0034660
CHOLESTEROL BIOSYNTHESIS%REACTOME DATABASE ID RELEASE 81%191273 NCRNA PROCESSING%GOBP%GO:0034470 CELL CYCLE, MITOTIC%REACTOME DATABASE ID RELEASE 81%69278 M PHASE%REACTOME%R-HSA-68886.5 POST-TRANSCRIPTIONAL REGULATION OF GENE EXPRESSION%GOBP%GO:0010608 PROTEIN SYNTHESIS: GLUTAMIC ACID%PATHWHIZ%PW112922 CELL CYCLE CHECKPOINTS%REACTOME DATABASE ID RELEASE 81%69620 PROTEIN SYNTHESIS: LYSINE%SMPDB%SMP0111874 PROCESSING OF CAPPED INTRON-CONTAINING PRE-MRNA%REACTOME DATABASE ID RELEASE 81%72203
LONG-TERM SYNAPTIC POTENTIATION%GOBP%GO:0060291 RRNA PROCESSING IN THE NUCLEUS AND CYTOSOL%REACTOME DATABASE ID RELEASE 81%8868773 CHROMOSOME ORGANIZATION%GOBP%GO:0051276 HALLMARK_E2F_TARGETS%MSIGDBHALLMARK%HALLMARK_E2F_TARGETS RRNA PROCESSING%REACTOME DATABASE ID RELEASE 81%72312 PROTEIN SYNTHESIS: PROLINE%PATHWHIZ%PW113695 MITOTIC ANAPHASE%REACTOME DATABASE ID RELEASE 81%68882 PROTEIN SYNTHESIS: CYSTEINE%PATHWHIZ%PW112918 NCRNA PROCESSING%GOBP%GO:0034470
TRANSPORT OF INORGANIC CATIONS ANIONS AND AMINO ACIDS OLIGOPEPTIDES%REACTOME%R-HSA-425393.4 RRNA PROCESSING%REACTOME DATABASE ID RELEASE 81%72312 RIBONUCLEOPROTEIN COMPLEX BIOGENESIS%GOBP%GO:0022613 CELL CYCLE CHECKPOINTS%REACTOME DATABASE ID RELEASE 81%69620 RIBOSOME BIOGENESIS%GOBP%GO:0042254 PROTEIN SYNTHESIS: ALANINE%PATHWHIZ%PW101384 MITOTIC METAPHASE AND ANAPHASE%REACTOME DATABASE ID RELEASE 81%2555396 PROTEIN SYNTHESIS: SERINE%PATHWHIZ%PW120517 CHROMOSOME ORGANIZATION%GOBP%GO:0051276
SUPERPATHWAY OF CHOLESTEROL BIOSYNTHESIS%BIOCYC%PWY66-5 RRNA METABOLIC PROCESS%GOBP%GO:0016072 RRNA PROCESSING IN THE NUCLEUS AND CYTOSOL%REACTOME DATABASE ID RELEASE 81%8868773 CHROMOSOME SEGREGATION%GOBP%GO:0007059 RIBONUCLEOPROTEIN COMPLEX BIOGENESIS%GOBP%GO:0022613 PROTEIN SYNTHESIS: LYSINE%SMPDB%SMP0111874 CHROMOSOME ORGANIZATION%GOBP%GO:0051276 PROTEIN SYNTHESIS: ALANINE%PATHWHIZ%PW101384 CELL CYCLE CHECKPOINTS%REACTOME DATABASE ID RELEASE 81%69620
POSITIVE REGULATION OF SYNAPTIC TRANSMISSION%GOBP%GO:0050806 RRNA PROCESSING%GOBP%GO:0006364 RRNA PROCESSING%REACTOME DATABASE ID RELEASE 81%72312 DNA REPAIR%GOBP%GO:0006281 RRNA PROCESSING IN THE NUCLEUS AND CYTOSOL%REACTOME DATABASE ID RELEASE 81%8868773 PROTEIN SYNTHESIS: ASPARAGINE%SMPDB%SMP0111854 HALLMARK_E2F_TARGETS%MSIGDBHALLMARK%HALLMARK_E2F_TARGETS PROTEIN SYNTHESIS: GLUTAMIC ACID%PATHWHIZ%PW112922 MRNA METABOLIC PROCESS%GOBP%GO:0016071
BLOOD GROUP SYSTEMS BIOSYNTHESIS%REACTOME%R-HSA-9033658.3 MAJOR PATHWAY OF RRNA PROCESSING IN THE NUCLEOLUS AND CYTOSOL%REACTOME%R-HSA-6791226.5 DNA REPLICATION%GOBP%GO:0006260 MITOTIC METAPHASE AND ANAPHASE%REACTOME DATABASE ID RELEASE 81%2555396 MRNA METABOLIC PROCESS%GOBP%GO:0016071 PROTEIN SYNTHESIS: SERINE%PATHWHIZ%PW120517 CHROMOSOME SEGREGATION%GOBP%GO:0007059 PROTEIN SYNTHESIS: PROLINE%PATHWHIZ%PW113695 M PHASE%REACTOME%R-HSA-68886.5
CALCIUM ION TRANSMEMBRANE IMPORT INTO CYTOSOL%GOBP%GO:0097553 HALLMARK_MYC_TARGETS_V1%MSIGDBHALLMARK%HALLMARK_MYC_TARGETS_V1 MAJOR PATHWAY OF RRNA PROCESSING IN THE NUCLEOLUS AND CYTOSOL%REACTOME%R-HSA-6791226.5 MITOTIC ANAPHASE%REACTOME DATABASE ID RELEASE 81%68882 MAJOR PATHWAY OF RRNA PROCESSING IN THE NUCLEOLUS AND CYTOSOL%REACTOME%R-HSA-6791226.5 PROTEIN SYNTHESIS: ISOLEUCINE%SMPDB%SMP0111872 NUCLEAR CHROMOSOME SEGREGATION%GOBP%GO:0098813 PROTEIN SYNTHESIS: TRYPTOPHAN%PATHWHIZ%PW120526 RIBOSOME BIOGENESIS%GOBP%GO:0042254

Summary of GSEA results and Comparison to g:Profiler Results

The results we see here are extremely similar to the results that we got when we ran g:Profiler.

For upregulation, we see that at the beginning, we have a lot of developmental features, including tissue morphogenesis, cardiac and muscle tissue formation, structure formation and differentiation. We see that at the 40vs20 comparison, we see that all pathways are related to amino acid synthesis, which probably indicates that at this stage, the iPSCs are undergoing massive amount of protein synthesis. This could be related to the final bits of differentiation it will undergo.

I will mention, however, that the 40vs20 for the g:Profiler results are radically different, with them focussion more on ECM organization and response regulation.

I assume that this discrepancy likely comes down to the different annotation and geneset sources used, as variations in this results can affect this output.

In downregulation, the early stages seem to involve decreasing synaptic transmission, ribosomal development, DNA synthesis and repair, cell cycle elements. Interestingly, we actually see that the protein synthesis pathways seen upregulated in 40v20 are downregulated at 20v15. This likely means that there was a cellular event during differentiation where protein synthesis was heavily dis-favoured and after this event, amino acid synthesis ramped back up, presumbaly to meet the cells’ protein demands. Towards the tail end are just downregulation in chromosomal features and various metabolic pathways.

Once again, these results closely match the g:Profiler results. This is qualitative, however, since different parameters were used for both analyses, and the analyses itself is akin to apples and pears (same type of fruit-ish, but vastly different). Additionally, different reference sets were used, so this further pushes the idea that these results aren’t directly comparable.

Cytoscape Analysis

We can proceed with our Cytoscape analysis. Cytoscape (Shannon et al. 2003) is a piece of software that allows for graph-based views of lists. It was initially developed with the bioinformatics community in mind, and allows for pathway analysis via connection nodes that cluster according to gene sets. This is a powerful, graphical way of analyzing biological data in addition to what we’ve already done.

We’ve been following a bit of a pre-built pipeline from a paper published by Ruth and colleagues(Reimand et al. 2019). First, we will modify and clean up the code she provided to format the fgsea results into a format that Cytoscape will take in. Then, we’ll make our networks using Cytoscape.

As an aside, I realize that I’ve routinely had among the longest analyses of all students in the course since I did a longer time-series. As a result, I’m focussing on three timepoints instead, 60 vs 40, 40 vs 20, and 20 vs 15. Additionally, I also made a network for 60 vs 0 to compare the differences from the beginning to the end of the analysis.

Preparing GSEA Results

First, let’s get our GSEA results ready.

# The following code is courtesy of Ruth Isserlin
# It was updated and formatted for better readability
format_fgsea_results2 <- function(current_fgsea_results, current_ranks) {
    # Calculate the rank at max
    # fgsea returns the leading edge
    # We need to extract the highest rank from the set to get the rank at max
    calculated_rank_at_max <- apply(
      current_fgsea_results,
      1,
      FUN=function(x){
        max(which(names(current_ranks)
                  %in%
                    unlist(x[8])
                  )
            )
        }
      )
    

    # The last column is a comma separated list of genes that are found in the leading edge.
    # And the column will be ignored but you have to put something in that column, might as put something that might be useful
    gsea_results <- cbind(current_fgsea_results$pathway,
                                     current_fgsea_results$pathway,
                                     "Details",
                                     current_fgsea_results$size,
                                     current_fgsea_results$ES,
                                     current_fgsea_results$NES,
                                     current_fgsea_results$pval,
                                     current_fgsea_results$padj,
                                     0,
                                     calculated_rank_at_max,
                                     apply(current_fgsea_results,1,
                                           FUN=function(x){
                                             paste(unlist(x[8]), collapse=",")
                                             }
                                           )
                          ) 
    
    colnames(gsea_results) <- c("name",
                                "description",
                                "GS details",
                                "SIZE",
                                "ES",
                                "NES",
                                "pval",
                                "padj",
                                "FWER",
                                "Rank at Max",
                                "leading edge genes")
    
   return(gsea_results)
}

Now, let’s create a list of our Cytoscape-ready GSEA results

# Function to load ranks from a file
load_ranks <- function(rank_file) {
  ranks <- fread(rank_file, header = TRUE, sep = "\t")
  setNames(ranks$rank, ranks$GeneName)
}

# Function to process GSEA results and save them to TSV files
process_and_save_gsea_results <- function(gsea_results, rank_file, output_dir) {
  # Load ranks
  current_ranks <- load_ranks(rank_file)
  
  # Format the fgsea results using the provided function
  formatted_results <- as.data.frame(format_fgsea_results2(gsea_results, current_ranks))
  
  # Filter for positive and negative results based on NES and adjusted p-value
  positive_results <- formatted_results[formatted_results$NES > 0 & formatted_results$padj < 0.05, ]
  negative_results <- formatted_results[formatted_results$NES < 0 & formatted_results$padj < 0.05, ]
  
  # Generate output file paths
  base_name <- gsub("\\.rnk$", "", basename(rank_file))
  positive_file_path <- file.path(output_dir, paste0(base_name, "_positive.tsv"))
  negative_file_path <- file.path(output_dir, paste0(base_name, "_negative.tsv"))
  
  # Write the results to TSV files for Cytoscape import
  fwrite(positive_results, positive_file_path, sep = "\t", quote = FALSE)
  fwrite(negative_results, negative_file_path, sep = "\t", quote = FALSE)
}

# Specify the directories
rank_dir <- "ranks"
output_dir <- "Cytoscape_inputs"
dir.create(output_dir, showWarnings = FALSE)

# List all rank files and process them
rank_files <- list.files(rank_dir, pattern = "\\.rnk$", full.names = TRUE)
for (i in seq_along(rank_files)) {
  process_and_save_gsea_results(gsea_list[[i]], rank_files[i], output_dir)
}

Cytoscape Disclaimer

As a note, the following analyses were all conducted in Cytoscape. We installed the AutoAnnotate app so that we could do automatic annotations of our data.

Creating Enrichment Maps

Figure 1A: Day 60 to Day 0 comparison. The network was not at all manipulated, since that analysis comes downstream. Explanation will follow.
Figure 1A: Day 60 to Day 0 comparison. The network was not at all manipulated, since that analysis comes downstream. Explanation will follow.
Figure 1B: Day 60 to Day 40 comparison. The network was not at all manipulated, since that analysis comes downstream. Explanation will follow.
Figure 1B: Day 60 to Day 40 comparison. The network was not at all manipulated, since that analysis comes downstream. Explanation will follow.
Figure 1C: Day 40 to Day 20 comparison. The network was not at all manipulated, since that analysis comes downstream. Explanation will follow.
Figure 1C: Day 40 to Day 20 comparison. The network was not at all manipulated, since that analysis comes downstream. Explanation will follow.
Figure 1D: Day 20 to Day 15 comparison. The network was not at all manipulated, since that analysis comes downstream. Explanation will follow.
Figure 1D: Day 20 to Day 15 comparison. The network was not at all manipulated, since that analysis comes downstream. Explanation will follow.

All analyses used a default parameter of Q-value = 0.1 and Edge Cutoff = 0.375.

We’ll also display the number of nodes and edges for each

network_table <- data.frame(
  condition = c("60 VS 0", "60 VS 40", "40 VS 20", "20 VS 15"),
  nodes = c(974, 76, 632, 1241),
  edges = c(2658, 1796, 3263, 12571)
)

# Use kable to display the table
kable(network_table, caption = "Table x: Network Analysis Results", align = 'c')
Table x: Network Analysis Results
condition nodes edges
60 VS 0 974 2658
60 VS 40 76 1796
40 VS 20 632 3263
20 VS 15 1241 12571

Annotating our Enrichment Maps

We next used AutoAnnotate to annotate our data.

The clusterMaker App was used.

The Cluster algorithm was MCL Cluster

The Edge weight column used the similarity-coefficient.

Singleton clusters were created.

The networks were laid out such that they would try to prevent cluster overlap.

Figure 2A: Day 60 to Day 0 comparison. AutoAnnotation was used to annotate clustered pathways. These will be collapsed in the following step to allow us to better interpret.
Figure 2A: Day 60 to Day 0 comparison. AutoAnnotation was used to annotate clustered pathways. These will be collapsed in the following step to allow us to better interpret.
Figure 2B: Day 60 to Day 40 comparison. AutoAnnotation was used to annotate clustered pathways. These will be collapsed in the following step to allow us to better interpret.
Figure 2B: Day 60 to Day 40 comparison. AutoAnnotation was used to annotate clustered pathways. These will be collapsed in the following step to allow us to better interpret.
Figure 2C: Day 40 to Day 20 comparison. AutoAnnotation was used to annotate clustered pathways. These will be collapsed in the following step to allow us to better interpret.
Figure 2C: Day 40 to Day 20 comparison. AutoAnnotation was used to annotate clustered pathways. These will be collapsed in the following step to allow us to better interpret.
Figure 2D: Day 20 to Day 15 comparison. AutoAnnotation was used to annotate clustered pathways. These will be collapsed in the following step to allow us to better interpret.
Figure 2D: Day 20 to Day 15 comparison. AutoAnnotation was used to annotate clustered pathways. These will be collapsed in the following step to allow us to better interpret.

Publication Ready Figures

To do this, the Publication-Ready box was checked off in the EnrichmentMap panel. Legends were added as well.

Figure 3A: Day 60 to Day 0 comparison. Individual pathway labels were removed to generate a cleaner-looking plot to look good for publication.
Figure 3A: Day 60 to Day 0 comparison. Individual pathway labels were removed to generate a cleaner-looking plot to look good for publication.
Figure 3B: Day 60 to Day 40 comparison. Individual pathway labels were removed to generate a cleaner-looking plot to look good for publication.
Figure 3B: Day 60 to Day 40 comparison. Individual pathway labels were removed to generate a cleaner-looking plot to look good for publication.
Figure 3C: Day 40 to Day 20 comparison. Individual pathway labels were removed to generate a cleaner-looking plot to look good for publication.
Figure 3C: Day 40 to Day 20 comparison. Individual pathway labels were removed to generate a cleaner-looking plot to look good for publication.
Figure 3D: Day 20 to Day 15 comparison. Individual pathway labels were removed to generate a cleaner-looking plot to look good for publication.
Figure 3D: Day 20 to Day 15 comparison. Individual pathway labels were removed to generate a cleaner-looking plot to look good for publication.

Collapsing the Networks into Theme Networks

To do this, the annotations were all selected for, right-clicked on, and the option to create summary networks came up. They were then adjusted to look better.

Figure 4A: Day 60 to Day 0 comparison. This is a very large network due to it comparing differences at the beginning and the end of the analysis. We see a few connected clusters. In the top left, we see a cluster involving membrane polarization, presumably related to the membrane polarization that CMs need. In the middle-top, we see clusters regarding morphogeneses upregulated, which make sense, since these are iPSCs differentiating into CMs. Below that is a cluster regarding DNA synthesis being down-regulated, which fits in with the previous analyses.
Figure 4A: Day 60 to Day 0 comparison. This is a very large network due to it comparing differences at the beginning and the end of the analysis. We see a few connected clusters. In the top left, we see a cluster involving membrane polarization, presumably related to the membrane polarization that CMs need. In the middle-top, we see clusters regarding morphogeneses upregulated, which make sense, since these are iPSCs differentiating into CMs. Below that is a cluster regarding DNA synthesis being down-regulated, which fits in with the previous analyses.
Figure 4B: Day 60 to Day 40 comparison. This is a tiny network with no edges. Everything is segregated, though, what is up and down-regulated agree with GSEA and g:Profiler results.
Figure 4B: Day 60 to Day 40 comparison. This is a tiny network with no edges. Everything is segregated, though, what is up and down-regulated agree with GSEA and g:Profiler results.
Figure 4C: Day 40 to Day 20 comparison. In this network, we see a lot of inter-connections for downregulated pathways involving telomeres. This makes sense, since iPSCs are ‘immortal’ cells, and once they differentiate, telomerase activity should go down accordingly.
Figure 4C: Day 40 to Day 20 comparison. In this network, we see a lot of inter-connections for downregulated pathways involving telomeres. This makes sense, since iPSCs are ‘immortal’ cells, and once they differentiate, telomerase activity should go down accordingly.
Figure 4D: Day 20 to Day 15 comparison. Here we see a few clusters involving integrins, membrane, and cadherins, which make sense since it involves the cells growing on their plates and trying to assume the LV matrix form. There’s also an uptick in mitotis-related pathways.
Figure 4D: Day 20 to Day 15 comparison. Here we see a few clusters involving integrins, membrane, and cadherins, which make sense since it involves the cells growing on their plates and trying to assume the LV matrix form. There’s also an uptick in mitotis-related pathways.

All these results discussed in the figure headings agree with the previous two analyses done, as discussed in all of the figure headings.

I will say, seeing that telomerase activity was downregulated according to the GSEA + Cytoscape analysis for the day 40-20 comparison was extremely insightful, because it would make perfect sense that telomerase activity would go down as cells go from an immortal, stem-state, to a differentiated cell type.

Interpretation

Do the enrichment results support conclusions or mechanism discussed in the original paper?

As a note, some of the interpretation is taken, used, and modified from A2.

We once again face the same issue as A2 in determining this, because the authors were using other datasets to compare differentiation protocols to see whether or not using a Wnt inhibitor while directing differentiation into cardiomyocytes would cause their hPSCs to be more akin to adult LV cardiomyocytes. Our data and analysis is restricted to just this dataset, so all we can comment on is what we found. With this in mind, we can begin to answer this question. In our pairwise day comparisons we indeed see many of the same terms being upregulated in Figure 2C of the article. This indicates that, for upregulation at the very least, we have similar results. These include terms like muscle tissue development, actin-mediated cell contraction, and more. Additionally, we also get some similar terms for down-regulation, including ncRNA metabolic process and ribosome biogenesis. What we don’t see at all, however, are the down-regulation results, which we’ve found to be potentially interesting, including things like down-regulation of amino acid synthesis pathways and telomerase pathways. The telomerase pathway finding likely supports the conclusions made by the article. Together, we see that our analysis concurs with the research group’s, such that the protocol indeed up-regulated pathways related to creating cardiomyocytes and downregulated ‘general’ and ‘stem-my’ pathways as it differentiated.

Additionally, I would also say that the results of GSEA and GSEA + Cytoscape align well with those found in assignment 2. This likely indicates that our findings regarding the data were valid, and since we also line up with the original paper well, there’s reason to believe that we completed a robust total analysis. Additionally, it also indicates that the data used was robust, and that all pathway databases used were complete.

Can you find evidence, i.e. publications, to support some of the results that you see? How does this evidence support your results?

We will reiterate our comparison from A2 since it agrees with A3:

The article’s main goal was to show that a methodology could be derived to improve the differentiation of hPSCs into cardiomyocytes more effectively, and their choice was to using CHIR to inhibit Wnt. When the authors compared their dataset to Cyganek et al.’s, they actually found that the other group’s day 90 compared favourably to our group’s day 60, indicating that our group’s differentiation protocol was good at making mature cardiomyocytes more quickly. Now, for our analysis, it would have been nice for us to have been able to compare our analysis to the other group’s day 90 to see if we had similar ORA results, and to see if we ran a GSEA result on their data, to see if it could match our GSEA results. Additionally, in a study looking at NOTCH1 knockouts by Ye et al., they also conducted a overrepresentation analysis, and in their day 10 cardiomyocytes, they had a similar result where pathways upregulating heart muscle development were favoured, and pathways involved in the cell cycle were downregulated. Together, this increases the confidence that our results show a robust differentiation protocol of hPSCs in cardiomyocytes.

For our new additions, I’d like to focus on telomerase activity. In a study by Armstrong et al (Armstrong et al. 2005), the group also found that telomerase activity decreased as differentiation proceeded. As a result, I am far more confident regarding this finding.

While Killberg et al.(Kilberg, Terada, and Shan 2016) published this as a review article, they discuss the important of amino acids as a form of metabolism. If this is the case, it stands to be true that perhaps the cells were downregulating amino acid synthesis due to an over-abundance, and potentially a switch to amino acid metabolism so that they could effectively use the excess amount of amino acids in vivo. Further analysis in amino acid metabolism pathways would likely be fruitful here.

Post-Analysis using Transcription Factors

I chose to look at specific transcription factors that may be related to our networks here. I focussed on a range of them, including ADA2 and FOXE1 since my former PI would mention them so frequently in regards to this project.

For ADA2, it serves as a growth factor, and since it’s a deaminase(Lee et al. 2020), it could be involved in the aforementioned drop for protein synthesis.

For FOXE1, a previous study found that it was involved in thyroid morphogenesis as it controls differentiation(Zannini et al. 1997), but it caught my attention as a top hit in our analysis, so I’m including it.

Since my initial project was interested in finding transcription factors that could be used at every stage of differentiation, I decided to limit it to pairwise comparison (i.e., not the 60 vs 0).

To do the following analyses, I used a two-sided Mann-Whitney. I took the top 5-15 results from each because it got extremely messy otherwise.

Figure 5A: Day 60 to Day 40 comparison. All the TFs seem to be acting on dependent protein synthesis, including ADA2 and FOXE1, indicating that they’re linked in the downregulation of dependent protein synthesis.
Figure 5A: Day 60 to Day 40 comparison. All the TFs seem to be acting on dependent protein synthesis, including ADA2 and FOXE1, indicating that they’re linked in the downregulation of dependent protein synthesis.
Figure 5B: Day 40 to Day 20 comparison. Again, the same two seem to all go towards sars cov translation. This cluster is actually refering to protein synthesis and a few metabolic pathways, but has up and down-regulated pathways.
Figure 5B: Day 40 to Day 20 comparison. Again, the same two seem to all go towards sars cov translation. This cluster is actually refering to protein synthesis and a few metabolic pathways, but has up and down-regulated pathways.
Figure 5C: See below
Figure 5C: See below

Day 20 to Day 15 comparison. The aforementioned TFs don’t appear here, but SUPT20H(Gomes et al. 2002), a TF involved in antigen-positive hematopoietic stem cells present atgli3 proteasome degradation.

I admittedly did not get as much as I wanted to out of this analysis. What we can immediately see is that ADA2 and FOXE1 are both TFs involved protein synthesis and downregulation. This however does show that there are many transcription factors that are significant and contribute to the modulation of protein synthesis, which could prove to be integral to iPSC to CM differentiation.

References

Armstrong, L, G Saretzki, H Peters, I Wappler, J Evans, N Hole, T von Zglinicki, and M Lako. 2005. “Overexpression of Telomerase Confers Growth Advantage, Stress Resistance, and Enhanced Differentiation of ESCs Toward the Hematopoietic Lineage.” Stem Cells 23 (4): 516–29.
Dark, Nicola, Marie-Victoire Cosson, Lorenza I Tsansizi, Thomas J Owen, Elisa Ferraro, Alice J Francis, Selina Tsai, et al. 2023. “Generation of Left Ventricle-Like Cardiomyocytes with Improved Structural, Functional, and Metabolic Maturity from Human Pluripotent Stem Cells.” Cell Rep Methods 3 (4): 100456–56. https://doi.org/https://doi.org/10.1016/j.crmeth.2023.100456.
Davis, Sean, and Paul Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 14: 1846–47.
Gomes, Ignatius, Tiffany T Sharma, Seby Edassery, Noreen Fulton, Brenton G Mar, and Carol A Westbrook. 2002. “Novel Transcription Factors in Human CD34 Antigen-Positive Hematopoietic Cells.” Blood 100 (1): 107–19.
Kilberg, Michael S, Naohiro Terada, and Jixiu Shan. 2016. “Influence of Amino Acid Metabolism on Embryonic Stem Cell Function and Differentiation.” Adv. Nutr. 7 (4): 780S–789S.
Korotkevich, Gennady, Vladimir Sukhov, Nikolay Budin, Boris Shpak, Maxim N Artyomov, and Alexey Sergushichev. 2016. “Fast Gene Set Enrichment Analysis.” bioRxiv. bioRxiv.
Lee, Pui Y, Erinn S Kellner, Yuelong Huang, Elissa Furutani, Zhengping Huang, Wayne Bainter, Mohammed F Alosaimi, et al. 2020. “Genotype and Functional Correlates of Disease Phenotype in Deficiency of Adenosine Deaminase 2 (DADA2).” J. Allergy Clin. Immunol. 145 (6): 1664–1672.e10.
Reimand, Jüri, Ruth Isserlin, Veronique Voisin, Mike Kucera, Christian Tannus-Lopes, Asha Rostamianfar, Lina Wadi, et al. 2019. “Pathway Enrichment Analysis and Visualization of Omics Data Using g:profiler, GSEA, Cytoscape and EnrichmentMap.” Nat. Protoc. 14 (2): 482–517.
Reimand, Jüri, Meelis Kull, Hedi Peterson, Jaanus Hansen, and Jaak Vilo. 2007. “G:profiler–a Web-Based Toolset for Functional Profiling of Gene Lists from Large-Scale Experiments.” Nucleic Acids Res. 35 (Web Server issue): W193–200.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.
Shannon, Paul, Andrew Markiel, Owen Ozier, Nitin S Baliga, Jonathan T Wang, Daniel Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Res. 13 (11): 2498–2504.
Tweedie, Susan, Bryony Braschi, Kristian Gray, Tamsin E M Jones, Ruth L Seal, Bethan Yates, and Elspeth A Bruford. 2020. “Genenames.org: The HGNC and VGNC Resources in 2021.” Nucleic Acids Research 49 (D1): D939–46. https://doi.org/https://doi.org/10.1093/nar/gkaa980.
Zannini, M, V Avantaggiato, E Biffali, M I Arnone, K Sato, M Pischetola, B A Taylor, S J Phillips, A Simeone, and R Di Lauro. 1997. TTF-2, a New Forkhead Protein, Shows a Temporal Expression in the Developing Thyroid Which Is Consistent with a Role in Controlling the Onset of Differentiation.” EMBO J. 16 (11): 3185–97.